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Abstract 

Predicting interactions between small molecules and proteins is a crucial ingredient of the 
drug discovery process. In particular, accurate predictive models are increasingly used to prese- 
lect potential lead compounds from large molecule databases, or to screen for side-effects. While 
classical in silico approaches focus on predicting interactions with a given specific target, new 
chemogenomics approaches adopt cross-target views. Building on recent developments in the use 
of kernel methods in bio- and chemoinformatics, we present a systematic framework to screen 
the chemical space of small molecules for interaction with the biological space of proteins. We 
show that this framework allows information sharing across the targets, resulting in a dramatic 
improvement of ligand prediction accuracy for three important classes of drug targets: enzymes, 
GPCR and ion channels. 



1 Introduction 



Predicting interactions between small molecules and proteins is a key element in the drug discov- 
ery process. In particular, several classes of proteins such as G-protein-coupled receptors (GPCR), 
enzymes and ion channels represent a large frac t ion of current drug targets and important targets 
for new drug development ( Hopkins and Groom . 20021 ). Understanding and predicting the interac- 
tions between small molecules and such proteins could therefore help in the discovery of new lead 
compounds. 

Various approaches h ave already been d eveloped and have proved very useful to address this 



in silico prediction issue (jManly et all l200ll ). The classical paradigm is to predict the modulators 
of a given target, considering each target as a different problem. Usual methods are classified 
into ligand-based and structure-based or docking approaches. Ligand-based approaches compare 
a candidate ligand to the k nown ligands of the target to make the i r prediction, typica lly using 
machine learning algorithms ( Butina et al. . 2002 : Byvatov et al. . 20031 : Zernov et al. . 20031 ) whereas 
structure-based a pproaches use the 3D -structure of the target to determine how well each candidate 



binds the target ( Halperin et al. . 20021 ) 



Ligand-based approaches necessitate to know enough ligands of a given target with respect to the 
complexity of the ligand/non- ligand separation to produce accurate predictors. If few or no ligands 
are known for a target, one is compelled to use docking approaches, which in turn necessitate to know 
the 3D structure of the target and are very time consuming. If for a given target with unavailable 
3D structure no ligand is known, none of the classical approaches can apply. This is the case for 



1 



many GPCR as very few structures have been crystallized so far (IBallesteros and Palczewskil . l200ll ) 
and many of these receptors, referred to as orphan GPCR, have no known ligand. 

An interesting way to solve this problem is to cast it in the chemogenomics framework. Chemoge- 
nomics aims at mining the chemical space, which roughly corresponds to the set of all small 
molecules, for interactions with the biological space, i.e., the set of all proteins, in particular drug 
targets. A salient feature of the chemogenomics approach is the realization that some classes of 
molecules can bind "similar" proteins, suggesting that the knowledge of some ligands for a target 
can be helpful to determine ligands for similar targets. Besides, this type of method allows for a 
more rational approach to design drugs since controlling a whole ligand's selectivity profile is crucial 
to make sure that no side effect occurs and that the compound is compatible with therapeutical 
usage. 

Recent reviews (jKubinyi et all 12004 : Ijaroch and Weinmannl . 120061 : iKlabundd . 120071 : iRognanl . 



20071 ) list several chemqgenomic approaches to predict interactions between compounds and tar- 



gets dOloff et all 120061 : iBock and Goughl . l2005h . Many of these chemogenomics methods rely on 
some fixed choice of which targets should be used when learning a predictor for a given target, 
the mos t extreme example being the learnin g of a predictor for a whole family or subfamily of 
targets (iBalakin et all I2OO2I : iKlabimdeL l200fil ) . Most of them also need some specific procedure to 
choose which ligands of the selected targets are used and how they are used. 

We propose a method that uses existing and well tested machine learning algorithms, casting the 
interaction prediction problem in a joint ligand-target space. This embeds the sharing level threshold 
problem in a simple representation choice for which we also propose a systematic approach based 
on combinations of features of the ligand and features of the target. For the three families of targets 
of interest, we show that our approach outperforms the state-of-the-art individual SVM, and gives 
good performances even for targets with no known ligand. 



2 Method 

We formulate the typical in silico chemogenomics problem as the following learning problem: given a 
collection of n target/molecule pairs (tx, c\), . . . (t n , c n ) known to interact or not, estimate a function 
f(t,c) that would predict whether any chemical c binds to any target t. In this section we propose 
a rigorous and general framework to solve this problems, building on recent developments of kernel 
methods in bio- and chemoinformatics. 

2.1 From single-target screening to chemogenomics 

Much effort in chemoinformatics has been devoted to the more restricted problem of mining the 
chemical space for interaction with a single target t, using a training set of molecules c\, . . . , c n known 
to interact or not with the target. Machine learning approaches, such as artificial neural networks 
(ANN) or support vector machines (SVM), often provide competitive models for such problems. 
The simplest linear models start by representing each molecule c by a vector representation $(c), 
before estimating a linear function ft(c) = wj$>(c) whose sign (positive or negative) is used to 
predict whether or not the small molecule c is a ligand of the target t. The weight vector wt is 
typically estimated based on its ability to correctly predict the classes of molecules in the training 
set. 

The in silico chemogenomics problem is more general because data involving interactions with 
different targets are available to train a model which must be able to predict interactions between 
any molecule and any protein. In order to extend the previous machine learning approaches to this 
setting, we need to represent a pair (t, c) of target t and chemicals c by a vector $(£, c), then estimate 
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a linear function f(t,c) = w T Q(t,c) whose sign is used to predict whether or not c can bind to t. 
As before the vector w can be estimated from the training set of interacting and non-interacting 
pairs, using any linear machine learning algorithm. 

To summarize, we propose to cast the in silico chemogenomics problem as a learning problem 
in the ligand-target space thus making it suitable to any classical linear machine learning approach 
as soon as a vector representation $(t, c) is chosen for protein/ligand pairs. We propose in the next 
sections a systematic way to design such a representation. 



2.2 Vector representation of target/ligand pairs 

A l lii. j. :„ „u„„,„:„.r l J j-„ j.v 



A large literature in chemoinformatics has been devoted to the problem of representing a molecule 
t by a vector <&u g and{c) £ K dc , e.g., using various molecular descriptors (ITodeschini and Consonnil . 



2002( 1. These descriptors encode several features related to the physico-chemical and structural 



properties of the molecules, and are widely used to model interactions between the small molecule s 
and a single target using linear models described in the previous section ( Gasteiger and Engell . 2003). 



Similarly, much work in computational biology has been devoted to the construction of descriptors 
for genes and proteins, in order to represent a given protein t by a vector $target(t) £ M*. The 
descriptors typically capture properties of the sequence or structure of the protein, and can be used 
to infer models to predict, e.g., the structural or functional class of a protein. 

For our in silico chemogenomics problem we need to represent each pair (c, t) of small molecule 
and protein by a single vector <3?(c, t). In order to capture interactions between features of the 
molecule and of the protein that may be useful predictors for the interaction between c and t, we 
propose to consider features for the pair (c, t) obtained by multiplying a descriptor of c with a 
descriptor of t. Intuitively, if for example the descriptors are binary indicators of specific structural 
features in each small molecule and proteins, then the product of two such features indicates that 
both the small molecule and the target carry specific features, which may be strongly correlated with 
the fact that they interact. More generally, if a molecule c is represented by a vector of descriptors 
§iigand{c) £ ~^ dc and a target protein by a vector of descriptors $target(t) £ ^- dt , this suggests to 
represent the pair (c, t) by the set of all possible products of features of c and t, i.e., by the tensor 
product: 

$(C, t) = $Ugand(c) <5>target{t) . (1) 

Remember that the tensor product in is a d c x d% vector whose (i,j)-th entry is exactly the 
product of the i-th entry of &n g and(c) by the j-th entry of $target(t). This representation can be 
used to combine in a principled way any vector representation of small molecules with any vector 
representation of proteins, for the purpose of in silico chemogenomics or any other task involving 
pairs of molecules/protein. A potential issue with this approach, however, is that the size of the 
vector representation for a pair may be prohibitively large for practical computation and storage. 
For example, using a vector of molecular descriptors of size 1024 for molecules and representing a 
protein by the vector of counts of all 2-mers of amino-acids in its sequence (dt = 20 x 20 = 400) 
results in more than 400k dimensions for the representation of a pair. In order to circumvent this 
issue we now show how kernel methods such as SVM can efficiently work in such large spaces. 



2.3 Kernels for target/ligand pairs 

SVM is an algorithm to estimate l inear binary classifiers from a training set of patterns with known 
class ( Boser et al. . 1992 : Vapnik . 19981 ). A salient feature of SVM, often referred to as the ker- 
nel trick, is its ability to process large- or even infinite-dimensional patterns as soon as the in- 
ner product between any two patterns can be efficiently computed. This property is shared by 
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a large number of popular linear algorithms, collectively referred to as kernel methods, inc l uding 
for example algorithms for regressi on, clustering or outlier detection (jScholkopf and Smolal . 12002 : 
Shawe- Taylor and Cristianini . 20041 ). 

In order to apply kernel methods such as SVM for in silico chemogenomics, we therefore need 
to show how to efficiently compute the inner product between the vector representations of two 
molecule/protein pairs. Interestingly, a classical and easy to check property of tensor products 
allows to write the inner product between two tensor product vectors as a product of inner products: 

(^ligand(c) <S> $target{t)) T ($ligand(c) ® $target(t')) = $ligand(c) J §ligand{c) X $target{t) T $target{t') ■ 

(2) 

This factorization dramatically reduces the burden of working with tensor products in large di- 
mensions. For example, in our previous example where the dimensions of the small molecule and 
proteins are vectors of respective dimensions 1024 and 400, the inner product in > 400/c dimensions 
between tensor products is simply obtained from (|2j) by computing two inner products, respectively 
in dimensions 1024 and 400, before taking their product. 

Even more interestingly, this reasoning extends to the case where inner products between vector 
representations of small molecu les and protei ns can themselves be efficiently computed with the 
help of positive definite kernels ( Vapnik . 19981 ). as explained in the next secti ons. Positive definite 
kernels are linked to inner products by a fundamental result ( Aronszajn . 195Cll ): the kernel between 
two points is equivalent to an inner product between the points mapped to a Hilbert space uniquely 
defined by the kernel. Now by denoting 



^■ligandi^: C ) 



®ligand(c) T ®ligand{c') , K target (t,t') — &target(t) T $target(t') 



we obtain the inner product between tensor products by: 

K {(c,t), (c',t')) = K target (t,t') x K Hgand (c,c). 



(3) 



In summary, as soon as two kernels Kn gan d and Ktarget corresponding to two implicit embeddings 
of the chemical and biological spaces in two Hilbert spaces are chosen, we can solve the in silico 
chemogenomics problem with an SVM (or any other relevant kernel method) using the product 
kernel J3|) between pairs. The particular kernels Kn gam i and K tar get should ideally encode properties 
related to the ability of similar molecules to bind similar targets or ligands respectively. We review 
in the next two sections possible choices for such kernels. 



2.4 Kernels for ligands 



Rece nt years have witnessed impressive advances in the use of SVM in chemoinformatics (llvanciud . 
20071 ). In particular much work has focused on the development of kernels for small molecules 
for the purpose of single-target virtual screening and prediction of pharmacokinetics and toxic- 
ity. For example simple inner products between vectors of classical molecular descriptors have 
been w idely investigated, including physicochemical pr o pertie s of molecules or 2D and 3D finger 



prints ( Todeschini and Consonni . 2002 : Azencott et al. . 200?! ). Other kernels have been designed 



directly from the comparison of 2D and 3D structures of molecules, including kern els based on the 



detection of common subs t ructu r es in the 2D struc t ures molecules seen as graphs (jKashima et al 



2003.12004: Gartner et al 



20051 : iRamon and Gartnerl . 



2003 



2003 



Ma he et all 120051: iRalaivola et all. 120051: iBorgwardt and Kriegell . 



Horvath et ali \2004: i lMahe and VerT 2006) or on the encodin g of 



■/ ■ — 17 I — IJ ■ 17 ■ -1 ■ 1-^1 ■! ■-■ p^J" < < 

various properties of the 3D structure of a molecules (jMahe et all 12000 : lAzencott et all 120071 ) 



While any of these kernels could be used to model the similarities of small molec ules and be 
plugged into J3)), we restrict ourselves in our experiment to a particular kernel proposed by lRalaivola et al 
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fcoofih called the Tanimoto kernel, a classical choice that usually gives state-of-the-art performances 
in molecule classification tasks. It is defined as: 



Kligand(C} C ) 



§ligand{c) T 



HgandK 



®ligand{c) T §Ugand{c) + <&ligand{c') T ^Ugand^) ~ $ligand(c) T ^ligand(c') ' 



(4) 



where <&ii ga nd( c ) is a binary vector whose bits indicate the presence or absence of all linear path 
of length I or less as subgraph of the 2D structure of c. We chose I = 8 in our experiment, 
i.e., characterize the molecules by the occurrences of linear subgraph s of length 8 or le ss, a value 
previously observed to give good results in several virtual screening task lMahe et We used 

the freely and publicly available ChemCPlQ software to compute this kernel in the experiments. 



2.5 Kernels for targets 

SVM and kernel methods are also widely used in bioinformatics ( Scholkopf et al. . 20041 ). and 
a variety of approaches have been proposed to design kernels between proteins, ranging from 



kernels based on the amino-acid sequence of a protein (Jaakko la et all 2000; Les 



200,- 



200f 



Tsuda et al. 2002: Ben 



Cuturi and Vert 



Borgwardt et cd 



Hur and Brutlagj . 120031 : iLeslie et al 



2004; Vert et al 



2004; 



ie et al 



2002; 



Kuang et al 



2005) to kernels based on the 3D structures of proteins (jDobson and Doigj . 



200.4 biu et all \2Q0i ) or the pattern of occurrences of proteins in multiple 



sequenced genomes (jVertl . 120021 ). These kernels have been used in conjunction with SVM or other 
kernel methods for various tasks related to structural or functional classification of proteins. While 
any of these kernels can theoretically be used as a target kernel in ((3|), we investigate in this pa- 
per a restricted list of specific kernels described below, aimed at illustrating the flexibility of our 
framework and test various hypothesis. 



The Dirac kernel between two targets t, t' is: 



Diraciti t ) 



1 if t = £ , 
otherwise. 



(5) 



This basic kernel simply represents different targets as orthonormal vectors. From j3|) we 
see that orthogonality between two proteins t and t' implies orthogonality between all pairs 
(c, t) and (c',t ! ) for any two small molecules c and c'. This means that a linear classifier for 
pairs (c, t) with this kernel decomposes as a set of independent linear classifiers for interactions 
between molecules and each target protein, which are trained without sharing any information 
of known ligands between different targets. In other words, using Dirac kernel for proteins 
amounts to performing classical learning independently for each target, which is our baseline 
approach. 

The multitask kernel between two targets i, t' is defined as: 

■^multitask (t j t ) — 1 + K Diracij^i t ) ■ 



This kernel, originally proposed in the context of multitask learning lEvgeniou et al\ (l2005h . 
removes the orthogona lity of different proteins to allow sharing of information. As explained in 
Evgenion etal\ J2005I ). plugging K mu ititask in © amounts to decomposing the linear function 



1 Available at http://chemcpp.sourceforge.net 
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used to predict interactions as a sum of a linear function common to all targets and of a linear 
function specific to each target: 



f(c,t) = w T $(c,t) = wj enera i^ii gand (c) + wj<5>ii gand (c) . 

A consequence is that only data related to the the target t are used to estimate the specific 
vector w t , while all data are used to estimate the common vector w genera i. In our framework 
this classifier is therefore the combination of a target-specific part accounting for target-specific 
properties of the ligands and a global part accounting for general properties of the ligands 
across the targets. The latter term allows to share information during the learning process, 
while the former ensures that specificities of the ligands for each target are not lost. 

• While the multitask kernel provides a basic framework to share information across proteins, it 
does not allow to weight differently how known interactions with a protein t should contribute 
to predict interactions with a target t' . Empirical observations underlying chemogenomics, 
on the other hand, suggest that molecules binding a ligand t are only likely to bind ligand t' 
similar to t in terms of structure or evolutionary history. In terms of kernels this suggest to 
plug into ((H) a kernel for proteins that quantifies this notion of similarity between proteins, 
which can for example be detected by comparing the sequences of proteins. In order to test 
this approach, we th erefore tested two commonly-used kernels between protein sequences: 
the mismatch kernel ( Leslie et al. . 20041 ). which compares proteins in terms of co mmon short 



seque nces of amino acids up to some mismatches, and the local alignment kernel (jVert et al 



20041 ) which measures the similarity between proteins as an alignment score between their 



primary sequences. In our experiments involving the mismatch kernel, we use the classical 
choice of 3-mers with a maximum of 1 mismatch, and for the datasets where some sequences 
were not available in the database, we added K£>i rac {t,t') to the kernel (and normalized at 1 
on the diagonal) in order to keep it valid. 

Alternatively we propose a new kernel aimed at encoding the similarity of proteins with 
respect to the ligands they bind. Indeed, for most major classes of drug targets such as 
the ones investigated in this study (GPCR, enzymes and ion channels), proteins have been 
organized into hierarchies that typically describe the precise functions of the proteins within 
ea ch family. En z ymes are labeled with Enzyme Commission numbers (EC numbers) defined 
in 



International! (jl992l ). that classify the chemical reaction they catalyze, forming a 4-level 



hierarchy encoded into 4 numbers. For example EC 1 includes oxydoreductases, EC 1.2 
includes oxidoreductases that act on the aldehyde or oxo group of donors, EC 1.2.2 is a 
subclass of EC 1.2 with NAD+ or NADP+ as acceptor and EC 1.2.2.1 is a subgroup of 
enzymes catalyzing the oxidation of formate to bicarbonate. These number define a natural 
and very informative hierarchy on enzymes: one can expect that enzymes that are closer in 
the hierarchy will tend to have more similar ligands. Similarly, GPCRs are grouped into 4 
classes based on sequence homology and functional similarity: the rhodopsin family (class A), 
the secretin family (class B), the metabotropic family (cl ass C) and a last class regrouping 
more diverse receptors (class D). The KEGG database ( Kanehisa et al. . 20021 ) subdivides 



the large rhodopsin family in three subgroups (amine receptors, peptide receptors and other 
receptors) and adds a second level of classification based on the type of ligands or known 
subdivisions. For example, the rhodopsin family with amine receptors is subdivided into 
cholinergic receptors, adrenergic receptors, etc. This also defines a natural hierarchy that we 
could use to compare GPCRs. Finally, KEGG also provides a classification of ion channels. 
Classification of ion channels is a less simple task since some of them can be classified according 
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to different criterions like voltage dependence or ligand-gating. The classification proposed by 
KEGG includes Cys-loop superfamily, glutamate- gated cation channels, epithelial and related 
Na+ channels, voltage-gated cation channels, related to voltage-gated cation channels, related 
to inward rectifier K+ channels, chloride channels and related to ATPase-linked transporters 
and each of these classes is further subdivided according for example to the type of ligands {e.g., 
glutamate receptor) or to the type of ion passing through the channel (e.g., Na+ channel). 
Here again, this hierarchy can be used to define a meaningful similarity in terms of interaction 
behavior. 

For each of the three target families, we define the hierarchy kernel between two targets of the 
family as the number of common ancestors in the corresponding hierarchy plus one, that is, 

K hier archy(t,t') = ($ h (t), <f> h (t')) , 

where $>h(t) contains as many features as there are nodes in the hierarchy, each being set to 1 
if the corresponding node is part of t's hierarchy and otherwise, plus one feature constantly 
set to one that accounts for the "plus one" term of the kernel. 



3 Data 

We e xtracted compound interaction data from the KEGG BRITE Database (jKanehisa et all 12002 . 



2004) concerning enzyme, GPCR and ion channel, three target classes particularly relevant for novel 



drug development. 

For each family, the database provides a list of known compounds for each target. Depending on 
the target families, various categories of compounds are defined to indicate the type of interaction 
between each target and each compound. These are for example inhibitor, cofactor and effector for 
enzyme ligands, antagonist or (full/partial) agonist for GPCR and pore blocker, (positive/negative) 
allosteric modulator, agonist or antagonist for ion channels. The list is not exhaustive for the latter 
since numerous categories exist. Although different types of interactions on a given target might 
correspond to different binding sites, it is theoretically possible for a non-linear classifier like SVM 
with non-linear kernels to learn classes consisting of several disconnected sets. Therefore, for the 
sake of clarity of our analysis, we do not differentiate between the categories of compounds. 

We eliminated all compounds for which no molecular descriptor was available (principally peptide 
compounds), and all the targets for which no compound was known. For each target, we generated 
as many negative ligand-target pairs as we had known ligands forming positive pairs by combining 
the target with a ligand randomly chosen among the other target's ligands (excluding those that 
were known to interact with the given target). This protocol generates false negative data since 
some ligands could actually interact with the target although they have not been experimentally 
tested, and our method could benefit from experimentally confirmed negative couples. 

This resulted in 2436 data points for enzymes (1218 known enzyme-ligand pairs and 1218 gen- 
erated negative points) representing interactions between 675 enzymes and 524 compounds, 798 
training data points for GPCRs representing interactions between 100 receptors and 219 com- 
pounds and 2330 ion channel data points representing interactions between 114 channels and 462 
compounds. Besides, Figure Q] shows the distribution of the number of known ligands per target for 
each dataset and illustrates the fact that for most of them, few compounds are known. 

For each target t in each family, we carried out two experiments. First, all data points corre- 
sponding to other targets in the family were used for training only and the nt points corresponding 
to t were /c-folded with k = mm(n t , 10). That is, for each fold, an SVM classifier was trained on 
all points involving other targets of the family plus a fraction of the points involving t, then the 
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Figure 1: Distribution of the number of training points for a target for the enzymes, GPCR and 
ion channel datasets. 



performances of the classifier were tested on the remaining fraction of data points for t. This pro- 
tocol is intended to assess the incidence of using ligands from other targets on the accuracy of the 
learned classifier for a given target. Second, for each target t we learned an SVM classifier using 
only interactions that did not involve t and tested on the points that involved t. This is intended to 
simulate the behavior of our framework when making predictions for orphan targets, i.e., for targets 
for which no ligand is known. 

For the first protocol, since learning an SVM with only one training point does not really make 
sense and can lead to "anti-learning" less than 0.5 performances, we set all results r involving 
the Dirac target kernel on targets with only 1 known ligand to max(r, 0.5). This is to avoid any 
artefactual penalization of the Dirac approach and make sure we measure the actual improvement 
brought by sharing information across targets. 



4 Results 



We first expose the results obtained on the three datasets for the first experiment, assessing how 
using training points from other targets of the family improves prediction accuracy with respect to 
individual (Dirac-based) learning. Table Q] shows the mean success rate across the family targets 
for an SVM with a product kernel using the Tanimoto kernel for ligands and various kernels for 
proteins. For the enzymes and ion channels datasets, we observe significant improvements when the 
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Figure 2: Target kernel Gram matrices (K tar ) for ion channels with multitask, hierarchy and local 
alignment kernels. 



multitask kernel is used in place of the Dirac kernel, on the one hand, and when the hierarchy kernel 
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replaces the multitask kernel, on the other hand. For example, the Dirac kernel only performs at 
an average accuracy of 70% for the ion channel dataset, while the multitask kernel increases the 
accuracy to 80% and the hierarchy kernel brings it to 88%. For the enzymes, a global improvement 
of 34.1% is observed between the Dirac and the hierarchy approaches. This clearly demonstrates 
the benefits of sharing information among known ligands of different targets, on the one hand, and 
the relevance of incorporating prior information into the kernels, on the other hand. 

On the GPCR dataset though, the multitask kernel performs worse than the Dirac kernel, 
probably because some targets in different subclasses show very different binding behavior which 
results in adding more noise than information when sharing naively with this kernel. However a 
more careful handling of the similarities between GPCRs through the hierarchy kernel again results 
in significant improvement over the Dirac kernel (from 68.2% to 81.7%), again demonstrating the 
relevance of the approach. 

Sequence-based target kernels do not achieve the same performance as the hierarchy kernel, 
although they perform relatively well for the ion channel dataset. In the case of enzymes, it can 
be explained by the diversity of the proteins in the family and f or the GPCR , by the well known 
fact that the receptors do not share overall sequence homology (Gether, 2000l ). Figure [2] shows 3 
of the tested target kernels for the ion channel dataset. The hierarchy kernel adds some structure 
information with respect to the multitask kernel, which explains the success rate increase. The 
local alignment sequence-based kernels fail to precisely re-build this structure but retains some 
substructures. In the cases of GPCR and enzymes, almost no structure is found by the sequence 
kernels, which, as alluded to above, was expectable and suggests that more subtle comparison of 
the sequences would be required to exploit the information they contain. 

Figure[3]illustrates the influence of the number of training points for a target on the improvement 
brought by using information from similar targets. As one could expect, the improvement is very 
strong when few ligands are known and decreases when enough training points become available. 
After a certain point (around 30 training points), using similar targets can even deteriorates the 
performances. This suggests that the method could be globally improved by learning for each 
target indep endently how much in formation should be shared, for example through kernel learning 
approaches ( Lanckriet et al. . 20041 ). 



Ktar\ Target Enzymes GPCR Channels 



Dirac 


0.536 


± 


0.005 


0.682 


± 


0.022 


0.701 


± 


0.017 


multitask 


0.874 


± 


0.008 


0.595 


± 


0.030 


0.797 


± 


0.017 


hierarchy 


0.877 


± 


0.008 


0.817 


± 


0.025 


0.857 


± 


0.015 


mismatch 


0.582 


± 


0.008 


0.638 


± 


0.030 


0.811 


± 


0.016 


local alignment 


0.544 


± 


0.007 


0.696 


± 


0.033 


0.824 


± 


0.015 



Table 1: Prediction accuracy for the first protocol on each dataset with various target kernels. 

The second experiment aims at pushing this remark to its limit by assessing how each strategy 
is able to predict ligands for proteins with no known ligand. Table [2] shows the results in that 
case. As expected, the classifiers using Dirac kernels show random behavior in this case since using 
a Dirac kernel with no data for the target amounts to learning with no training data at all. On 
the other hand we note that it is still possible to obtain reasonable results using adequate target 
kernels. In particular, the hierarchy kernel loses only 5.2% for the ion channel dataset, 4.1% for 
the GPCR dataset and 1.5% compared to the first experiment where known ligands were used, 
suggesting that if a target with no known compound is placed in the hierarchy through, e.g. in 
the case of GPCR homology detection with known members of the family using specific GPCR 
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Figure 3: Relative improvement of the hierarchy kernel against the Dirac kernel as a function of 
the number of known ligands for enzymes, GPCR and ion channel datasets. 



alignment algorithms (IKratochwil et or fingerprint analysis (jAttwood et all 20031 ). it is 

possible to predict some of its ligands almost as accurately as if some of them were already available. 



K tar \ Target Enzymes 



GPCR 



Channels 



Dirac 


0.500 


± 


0.000 


0.500 


± 


0.000 


0.500 


± 


0.000 


multitask 


0.856 


± 


0.009 


0.477 


± 


0.025 


0.636 


± 


0.021 


hierarchy 


0.862 


± 


0.009 


0.776 


± 


0.026 


0.805 


± 


0.018 


mismatch 


0.569 


± 


0.007 


0.579 


± 


0.028 


0.671 


± 


0.020 


local alignment 


0.521 


± 


0.004 


0.647 


± 


0.030 


0.722 


± 


0.019 



Table 2: Prediction accuracy for the second protocol on each dataset with various target kernels. 



5 Discussion 

We propose a general method to combine the chemical and the biological space in a principled 
way and predict interaction between any small molecule and any target, which makes it a vary 
valuable tool for drug discovery. The method allows to represent systematically a ligand-target 
couple, including information on the interaction between the ligand and the target. Prediction is 
then performed by any machine learning algorithm (an SVM in our case) in the joint space, which 
makes targets with few known ligands benefit from the data points of similar targets, and which 
allows to make predictions for targets with no known ligand. Our information sharing process 
therefore simply relies on a description choice for the ligands, another one for the targets and on 
classical machine learning methods: everything is done by casting the problem in a joint space 
and no explicit procedure to select which part of the information is shared is needed. Since it 
subdivides the representation problem into two subproblems, our approach makes use of previous 
work on kernels for molecular graphs and kernels for biological targets. For the same reason, it 
will automatically benefit from future improvements in both fields. This leaves plenty of room to 
increase the performance. 

Results on experimental ligand datasets show that using target kernels allowing to share infor- 
mation across the targets considerably improve the prediction, especially in the case of targets with 
few known ligands. The improvement is particularly strong when the target kernel uses prior infor- 
mation on the structure between the targets, e.g., a hierarchy defined on a target class. Although 
sequence kernels did not give very good results in our experiments, we believe using the target 
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sequence information could be an interesting alternative or complement to the hierarchy kernel. 
Further improvement could come from the use of kernel for structures in the cases where 3D struc- 
ture information is available (e.g. for the enzymes, but not for the GPCR). Our method also shows 
good performances even when no ligand at all is known for a given target, which is excellent news 
since classical ligand based approaches fail to predict ligand for these targets in the one hand, and 
docking approaches are computationally expensive and not feasible when the target 3D structure is 
unknown which is the case of GPCR in the other hand. 

In future work, it could be interesting to apply this framework to quantitative prediction of 
binding affinity using regression methods in the joint space. It would also be important to confirm 
predicted ligands experimentally or at least by docking approaches when the target 3D structure is 
available. 
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